KEGG

KEGG (http://www.kegg.jp/) is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.

Please note that the KEGG parser implementation in Biopython is incomplete. While the KEGG website indicates many flat file formats, only parsers and writers for compound, enzyme, and map are currently implemented. However, a generic parser is implemented to handle the other formats.

Parsing KEGG records

Parsing a KEGG record is as simple as using any other file format parser in Biopython. (Before running the following codes, please open http://rest.kegg.jp/get/ec:5.4.2.2 with your web browser and save it as ec_5.4.2.2.txt.)

In [2]:
!wget http://rest.kegg.jp/get/ec:5.4.2.2 -O ec_5.4.2.2.txt
--2016-01-12 20:44:45--  http://rest.kegg.jp/get/ec:5.4.2.2
Resolving rest.kegg.jp (rest.kegg.jp)... 133.103.200.77
Connecting to rest.kegg.jp (rest.kegg.jp)|133.103.200.77|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘ec_5.4.2.2.txt’

ec_5.4.2.2.txt          [   <=>                ] 105.99K   141KB/s   in 0.8s

2016-01-12 20:44:46 (141 KB/s) - ‘ec_5.4.2.2.txt’ saved [108534]

In [3]:
from Bio.KEGG import Enzyme

records = Enzyme.parse(open("ec_5.4.2.2.txt"))

record = list(records)[0]

record.classname
Out[3]:
['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']
In [4]:
record.entry
Out[4]:
'5.4.2.2'

The following section will shows how to download the above enzyme using the KEGG api as well as how to use the generic parser with data that does not have a custom parser implemented.

Querying the KEGG API

Biopython has full support for the querying of the KEGG api. Querying all KEGG endpoints are supported; all methods documented by KEGG (http://www.kegg.jp/kegg/rest/keggapi.html) are supported. The interface has some validation of queries which follow rules defined on the KEGG site. However, invalid queries which return a 400 or 404 must be handled by the user.

First, here is how to extend the above example by downloading the relevant enzyme and passing it through the Enzyme parser.

In [5]:
from Bio.KEGG import REST

from Bio.KEGG import Enzyme

request = REST.kegg_get("ec:5.4.2.2")

open("ec_5.4.2.2.txt", 'w').write(request.read().decode("utf-8"))
Out[5]:
108534
In [6]:
records = Enzyme.parse(open("ec_5.4.2.2.txt"))

record = list(records)[0]

record.classname
Out[6]:
['Isomerases;',
 'Intramolecular transferases;',
 'Phosphotransferases (phosphomutases)']
In [7]:
record.entry
Out[7]:
'5.4.2.2'

Now, here’s a more realistic example which shows a combination of querying the KEGG API. This will demonstrate how to extract a unique set of all human pathway gene symbols which relate to DNA repair. The steps that need to be taken to do so are as follows. First, we need to get a list of all human pathways. Secondly, we need to filter those for ones which relate to “repair”. Lastly, we need to get a list of all the gene symbols in all repair pathways.

In [8]:
from Bio.KEGG import REST

human_pathways = REST.kegg_list("pathway", "hsa").read()

human_pathways.decode("utf-8").split("\n")[0:5]
Out[8]:
['path:hsa00010\tGlycolysis / Gluconeogenesis - Homo sapiens (human)',
 'path:hsa00020\tCitrate cycle (TCA cycle) - Homo sapiens (human)',
 'path:hsa00030\tPentose phosphate pathway - Homo sapiens (human)',
 'path:hsa00040\tPentose and glucuronate interconversions - Homo sapiens (human)',
 'path:hsa00051\tFructose and mannose metabolism - Homo sapiens (human)']
In [9]:
# Filter all human pathways for repair pathways
repair_pathways = []
for line in human_pathways.decode("utf-8").rstrip().split("\n"):
    entry, description = line.split("\t")
    if "repair" in description:
        repair_pathways.append(entry)

repair_pathways
Out[9]:
['path:hsa03410', 'path:hsa03420', 'path:hsa03430']
In [10]:
# Get the genes for pathways and add them to a list
repair_genes = []
for pathway in repair_pathways:
    pathway_file = REST.kegg_get(pathway).read()  # query and read each pathway

    # iterate through each KEGG pathway file, keeping track of which section
    # of the file we're in, only read the gene in each pathway
    current_section = None
    for line in pathway_file.decode("utf-8").rstrip().split("\n"):
        section = line[:12].strip()  # section names are within 12 columns
        if not section == "":
            current_section = section

        if current_section == "GENE":
            gene_identifiers, gene_description = line[12:].split("; ")
            gene_id, gene_symbol = gene_identifiers.split()

            if not gene_symbol in repair_genes:
                repair_genes.append(gene_symbol)

print("There are %d repair pathways and %d repair genes. The genes are:" % \
        (len(repair_pathways), len(repair_genes)))
print(", ".join(repair_genes))
There are 3 repair pathways and 78 repair genes. The genes are:
OGG1, NTHL1, NEIL1, NEIL2, NEIL3, UNG, TDG, SMUG1, MUTYH, MPG, MBD4, APEX1, APEX2, POLB, POLL, HMGB1, XRCC1, PCNA, POLD1, POLD2, POLD3, POLD4, POLE, POLE2, POLE3, POLE4, LIG1, LIG3, PARP2, PARP1, PARP3, PARP4, FEN1, RBX1, CUL4B, CUL4A, DDB1, DDB2, XPC, RAD23B, RAD23A, CETN2, ERCC8, ERCC6, CDK7, MNAT1, CCNH, ERCC3, ERCC2, GTF2H5, GTF2H1, GTF2H2, GTF2H2C_2, GTF2H2C, GTF2H3, GTF2H4, ERCC5, BIVM-ERCC5, XPA, RPA1, RPA2, RPA3, RPA4, ERCC4, ERCC1, RFC1, RFC4, RFC2, RFC5, RFC3, SSBP1, PMS2, MLH1, MSH6, MSH2, MSH3, MLH3, EXO1

The KEGG API wrapper is compatible with all endpoints. Usage is essentially replacing all slashes in the url with commas and using that list as arguments to the corresponding method in the KEGG module. Here are a few examples from the api documentation (http://www.kegg.jp/kegg/docs/keggapi.html).

/list/hsa:10458+ece:Z5100            -> REST.kegg_list(["hsa:10458", "ece:Z5100"])
/find/compound/300-310/mol_weight    -> REST.kegg_find("compound", "300-310", "mol_weight")
/get/hsa:10458+ece:Z5100/aaseq      -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq")